Predicting Diabetes

Demi van den Biggelaar (9660089)
August Gesthuizen (5292565)
Friso Harff (7526946)
Leander van der Waal (7180063)

2026-01-15

Inhoudsopgave - Leander

  • De data
  • Onze research vraag
  • Modellen
  • Het beste model
  • Assumpties
  • Conclusie model
  • Antwoord research vraag

De Dataset

National Institute of Diabetes and Digestive and Kidney Diseases (1990)

Data uitleggen - Leander

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U/ml)

  • BMI: Body mass index (weight in kg/(height in m)^2)

  • DiabetesPedigreeFunction: Diabetes pedigree function

  • Age: Age (years)

  • Outcome: Class variable (0 or 1)

Inlezen data Leander

datacsv <- read.csv("./data/raw/diabetes.csv", header = TRUE)
saveRDS(datacsv, "./data/raw/dataset.rds")

# Preprocessing
dataRDS <- readRDS("./data/raw/dataset.rds")
dataRDS <- na.omit(dataRDS) # removes NA from important columns
saveRDS(dataRDS, "./data/processed/dataset.rds")
dataset <- readRDS("./data/processed/dataset.rds")

# Recoding BMI to classes, WAAROM OOKALWEER??
dataRDS$BMI[dataRDS$BMI <= 18.5 ] <- 1                     # Underweight
dataRDS$BMI[dataRDS$BMI > 18.5 & dataRDS$BMI <= 25  ] <- 2 # Healthy
dataRDS$BMI[dataRDS$BMI > 25 & dataRDS$BMI <= 30 ] <- 3    # Overweight
dataRDS$BMI[dataRDS$BMI > 30] <- 4                         # Obese

# Removing missing values
dataRDS <- subset(dataRDS,
                     BloodPressure != 0 &
                     SkinThickness != 0 &
                     Glucose != 0 &
                     Insulin != 0 &
                     BMI != 0)

Interessante bevindingen - Leander

Research Questions - Leander

Can we predict if someone is diabetic using the previously stated variables with a logistic regression?

Beste model kiezen en uitleggen (leg uit wat het beste model is, wat het doet en waarom het het beste is) - August

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:
#########VERGEET NIET DE MODERATIES TE BENOEMEN EN KOPPEL HET MISSCHIEN TERUG AAN DE PAIRS TABEL HIERBOVEN, OOK KAN JE ZELF ALTIJD MODELLEN AANPASSEN OM HET VERHAAL BETER TE MAKEN###########

assumptions, welke assumptions, waarom ze voldoen - Friso

Binary dependent variable

  • Diabetes (coded 1)
  • Not diabetes (coded 0)

Sufficiently large sample size

  • 10 cases per candidate predictor : \(N=\frac{10k}{p}\)
  Outcome   n      prop
1       0 263 0.6692112
2       1 130 0.3307888
[1] 61

Full-rank predictor matrix

  • More observations than predictors

  • No multicollinearity among linear predictors

                     Age                  Glucose                      BMI 
                1.022157                 1.018869                 1.009314 
DiabetesPedigreeFunction 
                1.006276 

assumptions, delete outliers, laat zien dat ie niet van belang is op case by case basis - Friso

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2

Exploring the model section, ga in op het model met onderzoeksmethode uit slides INCLUDING CONFUSION MATRIX - Demi

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2

Concludeer en beantwoord - Demi

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2